spark

Loading External Data

For this project, we chose to incorproate two additional datasets. We hope these external datasets can contextualize some of the results we see in the reddit comments.

The first external dataset is the weekly rankings of collegiate teams. We hope this provides some context from impartial observers who follow collegiate basketball and can provide some context as to how good – or bad – teams are compared to other national teams.

The second external dataset is the game-by-game results from the two teams we chose to focus on - UNC and Kansas. This will help provide us more information on how these teams are doing week in and week out.

Below, we load that data here.

import pandas as pd
import requests
import matplotlib as plt
import plotly.express as px
url = 'https://www.sports-reference.com/cbb/seasons/men/2022-polls.html'
rankings = pd.read_html(url, header=2)[0]
rankings = rankings.drop('Conf', axis=1)

rankings_transpose = rankings.T
rankings_transpose = rankings_transpose.reset_index(drop=True)

new_header = rankings_transpose.iloc[0] #grab the first row for the header
rankings_transpose = rankings_transpose[1:] #take the data less the header row
rankings_transpose.columns = new_header

for column in rankings_transpose.columns:
    rankings_transpose[column] = pd.to_numeric(rankings_transpose[column], errors='coerce')
rankings_transpose.dtypes

rankings_transpose = rankings_transpose.reset_index(drop=True)
rankings_transpose = rankings_transpose.drop(rankings_transpose.columns[20], axis=1)

rankings_transpose
Alabama Arizona Arkansas Auburn BYU Baylor Boise State Colorado State Davidson Duke ... Texas Tech UCLA UConn UNC USC Villanova Virginia Wisconsin Wyoming Xavier
0 14.0 NaN 16.0 22 NaN 8 NaN NaN NaN 9 ... NaN 2 24.0 19.0 NaN 4 25.0 NaN NaN NaN
1 14.0 NaN 16.0 21 NaN 9 NaN NaN NaN 7 ... NaN 2 23.0 18.0 25.0 5 NaN NaN NaN NaN
2 10.0 17.0 13.0 19 18.0 6 NaN NaN NaN 5 ... NaN 2 22.0 NaN 24.0 7 NaN NaN NaN 25.0
3 16.0 11.0 10.0 21 12.0 4 NaN NaN NaN 1 ... NaN 5 17.0 NaN 20.0 6 NaN 23.0 NaN NaN
4 9.0 11.0 12.0 18 24.0 2 NaN NaN NaN 3 ... NaN 4 15.0 NaN 16.0 6 NaN 22.0 NaN NaN
5 6.0 8.0 24.0 13 NaN 1 NaN 23.0 NaN 2 ... 25.0 4 20.0 NaN 10.0 9 NaN NaN NaN 22.0
6 10.0 6.0 NaN 12 NaN 1 NaN 21.0 NaN 2 ... 25.0 5 NaN NaN 8.0 23 NaN 24.0 NaN 18.0
7 19.0 9.0 NaN 11 NaN 1 NaN 20.0 NaN 2 ... 25.0 5 NaN NaN 7.0 22 NaN 24.0 NaN 23.0
8 15.0 8.0 NaN 9 NaN 1 NaN 20.0 NaN 2 ... 25.0 5 NaN NaN 7.0 19 NaN 23.0 NaN 22.0
9 24.0 6.0 NaN 4 NaN 1 NaN NaN NaN 8 ... 19.0 3 NaN NaN 5.0 14 NaN 13.0 NaN 17.0
10 NaN 3.0 NaN 2 NaN 5 NaN NaN NaN 6 ... 18.0 9 25.0 NaN 16.0 11 NaN 8.0 NaN 20.0
11 NaN 3.0 NaN 1 NaN 4 NaN NaN 25.0 9 ... 13.0 7 20.0 NaN 15.0 14 NaN 11.0 NaN 21.0
12 NaN 7.0 NaN 1 NaN 8 NaN NaN NaN 9 ... 14.0 3 17.0 NaN 19.0 12 NaN 11.0 NaN 21.0
13 NaN 4.0 NaN 1 NaN 10 NaN NaN NaN 7 ... 9.0 12 24.0 NaN 21.0 15 NaN 14.0 NaN 25.0
14 25.0 3.0 23.0 2 NaN 7 NaN NaN NaN 9 ... 11.0 13 24.0 NaN 17.0 10 NaN 15.0 22.0 NaN
15 24.0 2.0 18.0 3 NaN 10 NaN NaN NaN 7 ... 9.0 12 21.0 NaN 16.0 8 NaN 13.0 NaN NaN
16 25.0 2.0 14.0 5 NaN 3 NaN NaN NaN 4 ... 12.0 17 18.0 NaN 16.0 11 NaN 10.0 NaN NaN
17 NaN 2.0 15.0 4 NaN 3 NaN 23.0 NaN 7 ... 14.0 13 20.0 25.0 21.0 8 NaN 12.0 NaN NaN
18 NaN 2.0 17.0 8 NaN 4 23.0 24.0 NaN 9 ... 12.0 11 21.0 NaN 22.0 6 NaN 14.0 NaN NaN

19 rows × 47 columns

url_kansas = 'https://www.sports-reference.com/cbb/schools/kansas/men/2022-schedule.html'
kansas_schedule = pd.read_html(url_kansas, header=0)[1]
kansas_schedule = kansas_schedule[~kansas_schedule['Date'].str.contains('date', case=False, na=False)]
columns_keep = ['Date', 'Opponent', 'Unnamed: 8', 'Tm', 'Opp', 'OT', 'W', 'L', 'Streak']
kansas_schedule = kansas_schedule[columns_keep]
kansas_schedule
Date Opponent Unnamed: 8 Tm Opp OT W L Streak
0 Tue, Nov 9, 2021 Michigan State W 87 74 NaN 1 0 W 1
1 Fri, Nov 12, 2021 Tarleton State W 88 62 NaN 2 0 W 2
2 Thu, Nov 18, 2021 Stony Brook W 88 59 NaN 3 0 W 3
3 Thu, Nov 25, 2021 North Texas W 71 59 NaN 4 0 W 4
4 Fri, Nov 26, 2021 Dayton L 73 74 NaN 4 1 L 1
5 Sun, Nov 28, 2021 Iona W 96 83 NaN 5 1 W 1
6 Fri, Dec 3, 2021 St. John's (NY) W 95 75 NaN 6 1 W 2
7 Tue, Dec 7, 2021 UTEP W 78 52 NaN 7 1 W 3
8 Sat, Dec 11, 2021 Missouri W 102 65 NaN 8 1 W 4
9 Sat, Dec 18, 2021 Stephen F. Austin W 80 72 NaN 9 1 W 5
10 Wed, Dec 29, 2021 Nevada W 88 61 NaN 10 1 W 6
11 Sat, Jan 1, 2022 George Mason W 76 67 NaN 11 1 W 7
12 Tue, Jan 4, 2022 Oklahoma State W 74 63 NaN 12 1 W 8
13 Sat, Jan 8, 2022 Texas Tech (25) L 67 75 NaN 12 2 L 1
14 Tue, Jan 11, 2022 Iowa State (15) W 62 61 NaN 13 2 W 1
15 Sat, Jan 15, 2022 West Virginia W 85 59 NaN 14 2 W 2
16 Tue, Jan 18, 2022 Oklahoma W 67 64 NaN 15 2 W 3
17 Sat, Jan 22, 2022 Kansas State W 78 75 NaN 16 2 W 4
18 Mon, Jan 24, 2022 Texas Tech (13) W 94 91 2OT 17 2 W 5
19 Sat, Jan 29, 2022 Kentucky (12) L 62 80 NaN 17 3 L 1
20 Tue, Feb 1, 2022 Iowa State (20) W 70 61 NaN 18 3 W 1
21 Sat, Feb 5, 2022 Baylor (8) W 83 59 NaN 19 3 W 2
22 Mon, Feb 7, 2022 Texas (20) L 76 79 NaN 19 4 L 1
23 Sat, Feb 12, 2022 Oklahoma W 71 69 NaN 20 4 W 1
24 Mon, Feb 14, 2022 Oklahoma State W 76 62 NaN 21 4 W 2
25 Sat, Feb 19, 2022 West Virginia W 71 58 NaN 22 4 W 3
26 Tue, Feb 22, 2022 Kansas State W 102 83 NaN 23 4 W 4
27 Sat, Feb 26, 2022 Baylor (10) L 70 80 NaN 23 5 L 1
28 Tue, Mar 1, 2022 TCU L 64 74 NaN 23 6 L 2
29 Thu, Mar 3, 2022 TCU W 72 68 NaN 24 6 W 1
30 Sat, Mar 5, 2022 Texas (21) W 70 63 OT 25 6 W 2
32 Thu, Mar 10, 2022 West Virginia W 87 63 NaN 26 6 W 3
33 Fri, Mar 11, 2022 TCU W 75 62 NaN 27 6 W 4
34 Sat, Mar 12, 2022 Texas Tech (14) W 74 65 NaN 28 6 W 5
36 Thu, Mar 17, 2022 Texas Southern W 83 56 NaN 29 6 W 6
37 Sat, Mar 19, 2022 Creighton W 79 72 NaN 30 6 W 7
38 Fri, Mar 25, 2022 Providence (13) W 66 61 NaN 31 6 W 8
39 Sun, Mar 27, 2022 Miami (FL) W 76 50 NaN 32 6 W 9
40 Sat, Apr 2, 2022 Villanova (6) W 81 65 NaN 33 6 W 10
41 Mon, Apr 4, 2022 North Carolina W 72 69 NaN 34 6 W 11
unc_url = 'https://www.sports-reference.com/cbb/schools/north-carolina/men/2022-schedule.html'

unc_schedule = pd.read_html(unc_url, header=0)[1]
unc_schedule = unc_schedule[~unc_schedule['Date'].str.contains('date', case=False, na=False)]
unc_schedule = unc_schedule[columns_keep]
unc_schedule
Date Opponent Unnamed: 8 Tm Opp OT W L Streak
0 Tue, Nov 9, 2021 Loyola (MD) W 83 67 NaN 1 0 W 1
1 Fri, Nov 12, 2021 Brown W 94 87 NaN 2 0 W 2
2 Tue, Nov 16, 2021 College of Charleston W 94 83 NaN 3 0 W 3
3 Sat, Nov 20, 2021 Purdue (6) L 84 93 NaN 3 1 L 1
4 Sun, Nov 21, 2021 Tennessee (17) L 72 89 NaN 3 2 L 2
5 Tue, Nov 23, 2021 UNC Asheville W 72 53 NaN 4 2 W 1
6 Wed, Dec 1, 2021 Michigan (24) W 72 51 NaN 5 2 W 2
7 Sun, Dec 5, 2021 Georgia Tech W 79 62 NaN 6 2 W 3
8 Sat, Dec 11, 2021 Elon W 80 63 NaN 7 2 W 4
9 Tue, Dec 14, 2021 Furman W 74 61 NaN 8 2 W 5
10 Sat, Dec 18, 2021 Kentucky (21) L 69 98 NaN 8 3 L 1
11 Tue, Dec 21, 2021 Appalachian State W 70 50 NaN 9 3 W 1
12 Sun, Jan 2, 2022 Boston College W 91 65 NaN 10 3 W 2
13 Wed, Jan 5, 2022 Notre Dame L 73 78 NaN 10 4 L 1
14 Sat, Jan 8, 2022 Virginia W 74 58 NaN 11 4 W 1
15 Sat, Jan 15, 2022 Georgia Tech W 88 65 NaN 12 4 W 2
16 Tue, Jan 18, 2022 Miami (FL) L 57 85 NaN 12 5 L 1
17 Sat, Jan 22, 2022 Wake Forest L 76 98 NaN 12 6 L 2
18 Mon, Jan 24, 2022 Virginia Tech W 78 68 NaN 13 6 W 1
19 Wed, Jan 26, 2022 Boston College W 58 47 NaN 14 6 W 2
20 Sat, Jan 29, 2022 NC State W 100 80 NaN 15 6 W 3
21 Tue, Feb 1, 2022 Louisville W 90 83 OT 16 6 W 4
22 Sat, Feb 5, 2022 Duke (9) L 67 87 NaN 16 7 L 1
23 Tue, Feb 8, 2022 Clemson W 79 77 NaN 17 7 W 1
24 Sat, Feb 12, 2022 Florida State W 94 74 NaN 18 7 W 2
25 Wed, Feb 16, 2022 Pittsburgh L 67 76 NaN 18 8 L 1
26 Sat, Feb 19, 2022 Virginia Tech W 65 57 NaN 19 8 W 1
27 Mon, Feb 21, 2022 Louisville W 70 63 NaN 20 8 W 2
28 Sat, Feb 26, 2022 NC State W 84 74 NaN 21 8 W 3
29 Mon, Feb 28, 2022 Syracuse W 88 79 OT 22 8 W 4
30 Sat, Mar 5, 2022 Duke (4) W 94 81 NaN 23 8 W 5
32 Thu, Mar 10, 2022 Virginia W 63 43 NaN 24 8 W 6
33 Fri, Mar 11, 2022 Virginia Tech L 59 72 NaN 24 9 L 1
35 Thu, Mar 17, 2022 Marquette W 95 63 NaN 25 9 W 1
36 Sat, Mar 19, 2022 Baylor (4) W 93 86 OT 26 9 W 2
37 Fri, Mar 25, 2022 UCLA (11) W 73 66 NaN 27 9 W 3
38 Sun, Mar 27, 2022 Saint Peter's W 69 49 NaN 28 9 W 4
39 Sat, Apr 2, 2022 Duke (9) W 81 77 NaN 29 9 W 5
40 Mon, Apr 4, 2022 Kansas (3) L 69 72 NaN 29 10 L 1
import matplotlib.pyplot as plt

rankings_transpose.plot(kind='line', marker='o')

# Adding labels and title
plt.xlabel('Time')
plt.ylabel('Values')
plt.title('Line Graph of Team Values Over Time')

# Display the plot
plt.legend(bbox_to_anchor=(1.1, 1.05))
plt.show()



fig = px.line(rankings_transpose, 
              x=rankings_transpose.index, 
              y="Kansas",
              title="Weekly Ranking of Kansas",
              labels={"index":"Week of Ranking", "Kansas":"Ranking"})


fig.show()

fig = px.line(rankings_transpose, 
              x=rankings_transpose.index, 
              y="UNC",
              title="Weekly Ranking of UNC",
              labels={"index":"Week of Ranking", "Kansas":"Ranking"})


fig.show()

Above, we see what the external data looks like, and we also see how our individual teams fared throughout the season. The weekly rankings show a lot of change, and even when we zoom in on our two teams, there is a lot of movement as well (or, in the case of UNC, they are barely ranked!). This will help inform some of our other analysis as we continue.